Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information  Historically  information theory was developed by Claude E  Shannon to find fundamental limits on compressing and reliably storing and communicating data  Since its inception it has broadened to find applications in many other areas  including statistical inference  natural language processing  cryptography generally  networks other than communication networks  as in neurobiology   the evolution  and function  of molecular codes  model selection  in ecology  thermal physics   quantum computing  plagiarism detection  and other forms of data analysis  
A key measure of information in the theory is known as entropy  which is usually expressed by the average number of bits needed for storage or communication  Intuitively  entropy quantifies the uncertainty involved when encountering a random variable  For example  a fair coin flip  2 equally likely outcomes  will have less entropy than a roll of a die  6 equally likely outcomes  
Applications of fundamental topics of information theory include lossless data compression  e g  ZIP files   lossy data compression  e g  MP3s   and channel coding  e g  for DSL lines   The field is at the intersection of mathematics  statistics  computer science  physics  neurobiology  and electrical engineering  Its impact has been crucial to the success of the Voyager missions to deep space  the invention of the compact disc  the feasibility of mobile phones  the development of the Internet  the study of linguistics and of human perception  the understanding of black holes  and numerous other fields citation needed   Important sub-fields of information theory are source coding  channel coding  algorithmic complexity theory  algorithmic information theory  and measures of information 
Contents  hide 
1 Overview
2 Historical background
3 Quantities of information
3 1 Entropy
3 2 Joint entropy
3 3 Conditional entropy  equivocation 
3 4 Mutual information  transinformation 
3 5 KullbackLeibler divergence  information gain 
3 6 Other quantities
4 Coding theory
4 1 Source theory
4 1 1 Rate
4 2 Channel capacity
4 2 1 Channel capacity of particular model channels
5 Applications to other fields
5 1 Intelligence uses and secrecy applications
5 2 Pseudorandom number generation
5 3 Seismic Exploration
5 4 Miscellaneous applications
6 References
6 1 Footnotes
6 2 The classic work
6 3 Other journal articles
6 4 Textbooks on information theory
6 5 Other books
7 See also
7 1 Applications
7 2 History
7 3 Theory
7 4 Concepts
8 External links
 Overview

The main concepts of information theory can be grasped by considering the most widespread means of human communication  language  Two important aspects of a good language are as follows  First  the most common words  e g    a    the    I   should be shorter than less common words  e g    benefit    generation    mediocre    so that sentences will not be too long  Such a tradeoff in word length is analogous to data compression and is the essential aspect of source coding  Second  if part of a sentence is unheard or misheard due to noise  e g   a passing car  the listener should still be able to glean the meaning of the underlying message  Such robustness is as essential for an electronic communication system as it is for a language; properly building such robustness into communications is done by channel coding  Source coding and channel coding are the fundamental concerns of information theory 
Note that these concerns have nothing to do with the importance of messages  For example  a platitude such as  Thank you; come again  takes about as long to say or write as the urgent plea   Call an ambulance!  while clearly the latter is more important and more meaningful  Information theory  however  does not consider message importance or meaning  as these are matters of the quality of data rather than the quantity and readability of data  the latter of which is determined solely by probabilities 
Information theory is generally considered to have been founded in 1948 by Claude Shannon in his seminal work   A Mathematical Theory of Communication   The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel  The most fundamental results of this theory are Shannon s source coding theorem  which establishes that  on average  the number of bits needed to represent the result of an uncertain event is given by its entropy; and Shannon s noisy-channel coding theorem  which states that reliable communication is possible over noisy channels provided that the rate of communication is below a certain threshold called the channel capacity  The channel capacity can be approached in practice by using appropriate encoding and decoding systems 
Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more  adaptive systems  anticipatory systems  artificial intelligence  complex systems  complexity science  cybernetics  informatics  machine learning  along with systems sciences of many descriptions  Information theory is a broad and deep mathematical theory  with equally broad and deep applications  amongst which is the vital field of coding theory 
Coding theory is concerned with finding explicit methods  called codes  of increasing the efficiency and reducing the net error rate of data communication over a noisy channel to near the limit that Shannon proved is the maximum possible for that channel  These codes can be roughly subdivided into data compression  source coding  and error-correction  channel coding  techniques  In the latter case  it took many years to find the methods Shannon s work proved were possible  A third class of information theory codes are cryptographic algorithms  both codes and ciphers   Concepts  methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis  See the article ban  information  for a historical application 
Information theory is also used in information retrieval  intelligence gathering  gambling  statistics  and even in musical composition 
 Historical background

Main article  History of information theory
The landmark event that established the discipline of information theory  and brought it to immediate worldwide attention  was the publication of Claude E  Shannon s classic paper  A Mathematical Theory of Communication  in the Bell System Technical Journal in July and October of 1948 
Prior to this paper  limited information theoretic ideas had been developed at Bell Labs  all implicitly assuming events of equal probability  Harry Nyquist s 1924 paper  Certain Factors Affecting Telegraph Speed  contains a theoretical section quantifying  intelligence  and the  line speed  at which it can be transmitted by a communication system  giving the relation W = Klogm  where W is the speed of transmission of intelligence  m is the number of different voltage levels to choose from at each time step  and K is a constant  Ralph Hartley s 1928 paper  Transmission of Information  uses the word information as a measurable quantity  reflecting the receiver s ability to distinguish one sequence of symbols from any other  thus quantifying information as H = logSn = nlogS  where S was the number of possible symbols  and n the number of symbols in a transmission  The natural unit of information was therefore the decimal digit  much later renamed the hartley in his honour as a unit or scale or measure of information  Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers 
Much of the mathematics behind information theory with events of different probabilities was developed for the field of thermodynamics by Ludwig Boltzmann and J  Willard Gibbs  Connections between information-theoretic entropy and thermodynamic entropy  including the important contributions by Rolf Landauer in the 1960s  are explored in Entropy in thermodynamics and information theory 
In Shannon s revolutionary and groundbreaking paper  the work for which had been substantially completed at Bell Labs by the end of 1944  Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory  opening with the assertion that
 The fundamental problem of communication is that of reproducing at one point  either exactly or approximately  a message selected at another point  
With it came the ideas of
the information entropy and redundancy of a source  and its relevance through the source coding theorem;
the mutual information  and the channel capacity of a noisy channel  including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
the practical result of the ShannonHartley law for the channel capacity of a Gaussian channel; as well as
the bita new way of seeing the most fundamental unit of information 
 Quantities of information

Main article  Quantities of information
Information theory is based on probability theory and statistics  The most important quantities of information are entropy  the information in a random variable  and mutual information  the amount of information in common between two random variables  The former quantity indicates how easily message data can be compressed while the latter can be used to find the communication rate across a channel 
The choice of logarithmic base in the following formulae determines the unit of information entropy that is used  The most common unit of information is the bit  based on the binary logarithm  Other units include the nat  which is based on the natural logarithm  and the hartley  which is based on the common logarithm 
In what follows  an expression of the form  is considered by convention to be equal to zero whenever p = 0  This is justified because  for any logarithmic base 
 Entropy


Entropy of a Bernoulli trial as a function of success probability  often called the binary entropy function  Hb p   The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable  as in an unbiased coin toss 
The entropy  H  of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X 
Suppose one transmits 1000 bits  0s and 1s   If these bits are known ahead of transmission  to be a certain value with absolute probability   logic dictates that no information has been transmitted  If  however  each is equally and independently likely to be 0 or 1  1000 bits  in the information theoretic sense  have been transmitted  Between these two extremes  information can be quantified as follows  If  is the set of all messages x that X could be  and p x  is the probability of X given x  then the entropy of X is defined  

 Here  I x  is the self-information  which is the entropy contribution of an individual message  and  is the expected value   An important property of entropy is that it is maximized when all the messages in the message space are equiprobable p x  = 1 / n i e   most unpredictablein which case H X  = logn 
The special case of information entropy for a random variable with two outcomes is the binary entropy function 

 Joint entropy
The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing   X Y   This implies that if X and Y are independent  then their joint entropy is the sum of their individual entropies 
For example  if  X Y  represents the position of a chess piece  X the row and Y the column  then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece 

Despite similar notation  joint entropy should not be confused with cross entropy 
 Conditional entropy  equivocation 
The conditional entropy or conditional uncertainty of X given random variable Y  also called the equivocation of X about Y  is the average conditional entropy over Y  

Because entropy can be conditioned on a random variable or on that random variable being a certain value  care should be taken not to confuse these two definitions of conditional entropy  the former of which is in more common use  A basic property of this form of conditional entropy is that 

 Mutual information  transinformation 
Mutual information measures the amount of information that can be obtained about one random variable by observing another  It is important in communication where it can be used to maximize the amount of information shared between sent and received signals  The mutual information of X relative to Y is given by 

where SI  Specific mutual Information  is the pointwise mutual information 
A basic property of the mutual information is that

That is  knowing Y  we can save an average of I X;Y  bits in encoding X compared to not knowing Y 
Mutual information is symmetric 

Mutual information can be expressed as the average KullbackLeibler divergence  information gain  of the posterior probability distribution of X given the value of Y to the prior distribution on X 

In other words  this is a measure of how much  on the average  the probability distribution on X will change if we are given the value of Y  This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution 

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson s  2 test  mutual information can be considered a statistic for assessing independence between a pair of variables  and has a well-specified asymptotic distribution 
 KullbackLeibler divergence  information gain 
The KullbackLeibler divergence  or information divergence  information gain  or relative entropy  is a way of comparing two distributions  a  true  probability distribution p X   and an arbitrary probability distribution q X   If we compress data in a manner that assumes q X  is the distribution underlying some data  when  in reality  p X  is the correct distribution  the KullbackLeibler divergence is the number of average additional bits per datum necessary for compression  It is thus defined

Although it is sometimes used as a  distance metric   it is not a true metric since it is not symmetric and does not satisfy the triangle inequality  making it a semi-quasimetric  
 Other quantities
Other important information theoretic quantities include Rnyi entropy   a generalization of entropy   differential entropy   a generalization of quantities of information to continuous distributions   and the conditional mutual information 
 Coding theory

Main article  Coding theory


A picture showing scratches on the readable surface of a CD-R  Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using error detection and correction 
Coding theory is one of the most important and direct applications of information theory  It can be subdivided into source coding theory and channel coding theory  Using a statistical description for data  information theory quantifies the number of bits needed to describe the data  which is the information entropy of the source 
Data compression  source coding   There are two formulations for the compression problem 
lossless data compression  the data must be reconstructed exactly;
lossy data compression  allocates bits needed to reconstruct the data  within a specified fidelity level measured by a distortion function  This subset of Information theory is called ratedistortion theory 
Error-correcting codes  channel coding   While data compression removes as much redundancy as possible  an error correcting code adds just the right kind of redundancy  i e   error correction  needed to transmit the data efficiently and faithfully across a noisy channel 
This division of coding theory into compression and transmission is justified by the information transmission theorems  or sourcechannel separation theorems that justify the use of bits as the universal currency for information in many contexts  However  these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user  In scenarios with more than one transmitter  the multiple-access channel   more than one receiver  the broadcast channel  or intermediary  helpers   the relay channel   or more general networks  compression followed by transmission may no longer be optimal  Network information theory refers to these multi-agent communication models 
 Source theory
Any process that generates successive messages can be considered a source of information  A memoryless source is one in which each message is an independent identically-distributed random variable  whereas the properties of ergodicity and stationarity impose more general constraints  All such sources are stochastic  These terms are well studied in their own right outside information theory 
 Rate
Information rate is the average entropy per symbol  For memoryless sources  this is merely the entropy of each symbol  while  in the case of a stationary stochastic process  it is

that is  the conditional entropy of a symbol given all the previous symbols generated  For the more general case of a process that is not necessarily stationary  the average rate is

that is  the limit of the joint entropy per symbol  For stationary sources  these two expressions give the same result  
It is common in information theory to speak of the  rate  or  entropy  of a language  This is appropriate  for example  when the source of information is English prose  The rate of a source of information is related to its redundancy and how well it can be compressed  the subject of source coding 
 Channel capacity
Main article  Noisy channel coding theorem
Communications over a channelsuch as an ethernet wireis the primary motivation of information theory  As anyone who s ever used a telephone  mobile or landline  knows  however  such channels often fail to produce exact reconstruction of a signal; noise  periods of silence  and other forms of signal corruption often degrade quality  How much information can one hope to communicate over a noisy  or otherwise imperfect  channel 
Consider the communications process over a discrete channel  A simple model of the process is shown below 

Here X represents the space of messages transmitted  and Y the space of messages received during a unit time over our channel  Let p y | x  be the conditional probability distribution function of Y given X  We will consider p y | x  to be an inherent fixed property of our communications channel  representing the nature of the noise of our channel   Then the joint distribution of X and Y is completely determined by our channel and by our choice of f x   the marginal distribution of messages we choose to send over the channel  Under these constraints  we would like to maximize the rate of information  or the signal  we can communicate over the channel  The appropriate measure for this is the mutual information  and this maximum mutual information is called the channel capacity and is given by 

This capacity has the following property related to communicating at information rate R  where R is usually bits per symbol   For any information rate R < C and coding error e > 0  for large enough N  there exists a code of length N and rate = R and a decoding algorithm  such that the maximal probability of block error is = e; that is  it is always possible to transmit with arbitrarily small block error  In addition  for any rate R > C  it is impossible to transmit with arbitrarily small block error 
Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity 
 Channel capacity of particular model channels
A continuous-time analog communications channel subject to Gaussian noise  see ShannonHartley theorem 
A binary symmetric channel  BSC  with crossover probability p is a binary input  binary output channel that flips the input bit with probability p  The BSC has a capacity of 1 - Hb p  bits per channel use  where Hb is the binary entropy function 

A binary erasure channel  BEC  with erasure probability p is a binary input  ternary output channel  The possible channel outputs are 0  1  and a third symbol  e  called an erasure  The erasure represents complete loss of information about an input bit  The capacity of the BEC is 1 - p bits per channel use 

 Applications to other fields

 Intelligence uses and secrecy applications
Information theoretic concepts apply to cryptography and cryptanalysis  Turing s information unit  the ban  was used in the Ultra project  breaking the German Enigma machine code and hastening the end of WWII in Europe  Shannon himself defined an important concept now called the unicity distance  Based on the redundancy of the plaintext  it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability 
Information theory leads us to believe it is much more difficult to keep secrets than it might first appear  A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms  sometimes called secret key algorithms   such as block ciphers  The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time 
Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks  In such cases  the positive conditional mutual information between the plaintext and ciphertext  conditioned on the key  can ensure proper transmission  while the unconditional mutual information between the plaintext and ciphertext remains zero  resulting in absolutely secure communications  In other words  an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key  However  as in any other cryptographic system  care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material 
 Pseudorandom number generation
Pseudorandom number generators are widely available in computer language libraries and application programs  They are  almost universally  unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software  A class of improved random number generators is termed Cryptographically secure pseudorandom number generators  but even they require external to the software random seeds to work as intended  These can be obtained via extractors  if done carefully  The measure of sufficient randomness in extractors is min-entropy  a value related to Shannon entropy through Rnyi entropy; Rnyi entropy is also used in evaluating randomness in cryptographic systems  Although related  the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses 
 Seismic Exploration
One early commercial application of information theory was in the field seismic oil exploration  Work in this field made it possible to strip off and separate the unwanted noise from the desired seismic signal  Information theory and digital signal processing offer a major improvement of resolution and image clarity over previous analog methods  
 Miscellaneous applications
Information theory also has applications in gambling and investing  black holes  bioinformatics  and music 